33 research outputs found
CL-XABSA: Contrastive Learning for Cross-lingual Aspect-based Sentiment Analysis
As an extensive research in the field of Natural language processing (NLP),
aspect-based sentiment analysis (ABSA) is the task of predicting the sentiment
expressed in a text relative to the corresponding aspect. Unfortunately, most
languages lack of sufficient annotation resources, thus more and more recent
researchers focus on cross-lingual aspect-based sentiment analysis (XABSA).
However, most recent researches only concentrate on cross-lingual data
alignment instead of model alignment. To this end, we propose a novel
framework, CL-XABSA: Contrastive Learning for Cross-lingual Aspect-Based
Sentiment Analysis. Specifically, we design two contrastive strategies, token
level contrastive learning of token embeddings (TL-CTE) and sentiment level
contrastive learning of token embeddings (SL-CTE), to regularize the semantic
space of source and target language to be more uniform. Since our framework can
receive datasets in multiple languages during training, our framework can be
adapted not only for XABSA task, but also for multilingual aspect-based
sentiment analysis (MABSA). To further improve the performance of our model, we
perform knowledge distillation technology leveraging data from unlabeled target
language. In the distillation XABSA task, we further explore the comparative
effectiveness of different data (source dataset, translated dataset, and
code-switched dataset). The results demonstrate that the proposed method has a
certain improvement in the three tasks of XABSA, distillation XABSA and MABSA.
For reproducibility, our code for this paper is available at
https://github.com/GKLMIP/CL-XABSA
An interpretability framework for Similar case matching
Similar Case Matching (SCM) plays a pivotal role in the legal system by
facilitating the efficient identification of similar cases for legal
professionals. While previous research has primarily concentrated on enhancing
the performance of SCM models, the aspect of interpretability has been
neglected. To bridge the gap, this study proposes an integrated pipeline
framework for interpretable SCM. The framework comprises four modules: judicial
feature sentence identification, case matching, feature sentence alignment, and
conflict resolution. In contrast to current SCM methods, our framework first
extracts feature sentences within a legal case that contain essential
information. Then it conducts case matching based on these extracted features.
Subsequently, our framework aligns the corresponding sentences in two legal
cases to provide evidence of similarity. In instances where the results of case
matching and feature sentence alignment exhibit conflicts, the conflict
resolution module resolves these inconsistencies. The experimental results show
the effectiveness of our proposed framework, establishing a new benchmark for
interpretable SCM
Model and Evaluation: Towards Fairness in Multilingual Text Classification
Recently, more and more research has focused on addressing bias in text
classification models. However, existing research mainly focuses on the
fairness of monolingual text classification models, and research on fairness
for multilingual text classification is still very limited. In this paper, we
focus on the task of multilingual text classification and propose a debiasing
framework for multilingual text classification based on contrastive learning.
Our proposed method does not rely on any external language resources and can be
extended to any other languages. The model contains four modules: multilingual
text representation module, language fusion module, text debiasing module, and
text classification module. The multilingual text representation module uses a
multilingual pre-trained language model to represent the text, the language
fusion module makes the semantic spaces of different languages tend to be
consistent through contrastive learning, and the text debiasing module uses
contrastive learning to make the model unable to identify sensitive attributes'
information. The text classification module completes the basic tasks of
multilingual text classification. In addition, the existing research on the
fairness of multilingual text classification is relatively simple in the
evaluation mode. The evaluation method of fairness is the same as the
monolingual equality difference evaluation method, that is, the evaluation is
performed on a single language. We propose a multi-dimensional fairness
evaluation framework for multilingual text classification, which evaluates the
model's monolingual equality difference, multilingual equality difference,
multilingual equality performance difference, and destructiveness of the
fairness strategy. We hope that our work can provide a more general debiasing
method and a more comprehensive evaluation framework for multilingual text
fairness tasks
An Effective Deployment of Contrastive Learning in Multi-label Text Classification
The effectiveness of contrastive learning technology in natural language
processing tasks is yet to be explored and analyzed. How to construct positive
and negative samples correctly and reasonably is the core challenge of
contrastive learning. It is even harder to discover contrastive objects in
multi-label text classification tasks. There are very few contrastive losses
proposed previously. In this paper, we investigate the problem from a different
angle by proposing five novel contrastive losses for multi-label text
classification tasks. These are Strict Contrastive Loss (SCL), Intra-label
Contrastive Loss (ICL), Jaccard Similarity Contrastive Loss (JSCL), Jaccard
Similarity Probability Contrastive Loss (JSPCL), and Stepwise Label Contrastive
Loss (SLCL). We explore the effectiveness of contrastive learning for
multi-label text classification tasks by the employment of these novel losses
and provide a set of baseline models for deploying contrastive learning
techniques on specific tasks. We further perform an interpretable analysis of
our approach to show how different components of contrastive learning losses
play their roles. The experimental results show that our proposed contrastive
losses can bring improvement to multi-label text classification tasks. Our work
also explores how contrastive learning should be adapted for multi-label text
classification tasks.Comment: Accepted by ACL-Findings 2023, 13 page
Boosting Neural Machine Translation with Dependency-Scaled Self-Attention Network
Syntax knowledge contributes its powerful strength in Neural machine
translation (NMT) tasks. Early NMT works supposed that syntax details can be
automatically learned from numerous texts via attention networks. However,
succeeding researches pointed out that limited by the uncontrolled nature of
attention computation, the NMT model requires an external syntax to capture the
deep syntactic awareness. Although existing syntax-aware NMT methods have bored
great fruits in combining syntax, the additional workloads they introduced
render the model heavy and slow. Particularly, these efforts scarcely involve
the Transformer-based NMT and modify its core self-attention network (SAN). To
this end, we propose a parameter-free, dependency-scaled self-attention network
(Deps-SAN) for syntax-aware Transformer-based NMT. A quantified matrix of
dependency closeness between tokens is constructed to impose explicit syntactic
constraints into the SAN for learning syntactic details and dispelling the
dispersion of attention distributions. Two knowledge sparsing techniques are
further integrated to avoid the model overfitting the dependency noises
introduce by the external parser. Experiments and analyses on IWSLT14
German-to-English and WMT16 German-to-English benchmark NMT tasks verify the
effectiveness of our approach
A simple but effective method for Indonesian automatic text summarisation
Automatic text summarisation (ATS) (therein two main approaches–abstractive summarisation and extractive summarisation are involved) is an automatic procedure for extracting critical information from the text using a specific algorithm or method. Due to the scarcity of corpus, abstractive summarisation achieves poor performance for low-resource language ATS tasks. That’s why it is common for researchers to apply extractive summarisation to low-resource language instead of using abstractive summarisation. As an emerging branch of extraction-based summarisation, methods based on feature analysis quantitate the significance of information by calculating utility scores of each sentence in the article. In this study, we propose a simple but effective extractive method based on the Light Gradient Boosting Machine regression model for Indonesian documents. Four features are extracted, namely PositionScore, TitleScore, the semantic representation similarity between the sentence and the title of document, the semantic representation similarity between the sentence and sentence’s cluster center. We define a formula for calculating the sentence score as the objective function of the linear regression. Considering the characteristics of Indonesian, we use Indonesian lemmatisation technology to improve the calculation of sentence score. The results show that our method is more applicable
Towards Malay named entity recognition: an open-source dataset and a multi-task framework
Named entity recognition (NER) is a key component of many natural language processing (NLP) applications. The majority of advanced research, however, has not been widely applied to low-resource languages represented by Malay due to the data-hungry problem. In this paper, we present a system for building a Malay NER dataset (MS-NER) of 20,146 sentences through labelled datasets of homologous languages and iterative optimisation. Additionally, we propose a Multi-Task framework, namely MTBR, to integrate boundary information more effectively for NER. Specifically, boundary detection is treated as an auxiliary task and an enhanced Bidirectional Revision module with a gated ignoring mechanism is proposed to undertake conditional label transfer. This can reduce error propagation by the auxiliary task. We conduct extensive experiments on Malay, Indonesian, and English. Experimental results show that MTBR could achieve competitive performance and tends to outperform multiple baselines. The constructed dataset and model would be made available to the public as a new, reliable benchmark for Malay NER